OCR binarization and image pre-processing for searching historical documents

نویسندگان

  • Maya R. Gupta
  • Nathaniel P. Jacobson
  • Eric K. Garcia
چکیده

We consider the problem of document binarization as a pre-processing step for optical character recognition (OCR) for the purpose of keyword search of historical printed documents. A number of promising techniques from the literature for binarization, pre-filtering, and post-binarization denoising were implemented along with newly developed methods for binarization: an error diffusion binarization, a multiresolutional version of Otsu’s binarization, and denoising by despeckling. The OCR in the ABBYY FineReader 7.1 SDK is used as a black box metric to compare methods. Results for 12 pages from six newspapers of differing quality show that performance varies widely by image, but that the classic Otsu method and Otsu-based methods perform best on average. 2006 Pattern Recognition Society. Published by Elsevier Ltd. All rights reserved.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Combining multiple thresholding binarization values to improve OCR output

For noisy, historical documents, a high optical character recognition (OCR) word error rate (WER) can render the OCR text unusable. Since image binarization is often the method used to identify foreground pixels, a significant body of research has sought to improve image-wide binarization directly. Instead of relying on any one imperfect binarization technique, our method incorporates informati...

متن کامل

Binarization of Document Image

Documents Image Binarization is performed in the preprocessing stage for document analysis and it aims to segment the foreground text from the document background. A fast and accurate document image binarization technique is important for the ensuing document image processing tasks such as optical character recognition (OCR). Though document image binarization has been studied for many years, t...

متن کامل

Information Extraction from Historical Semi-Structured Handwritten Documents

In this paper, we describe our approach to extract salient events such as birth and death records from historical French parish documents that contain free-form handwritten text. The challenges posed by these documents to the current state of the art in handwriting recognition and information extraction go well beyond the generic challenges in recognizing handwritten text such as style variatio...

متن کامل

Phase-Based Binarization of Ancient Document Images

The main defects present in historical documents are darkness, non-uniform clarification, bleed-through and faded characters. To remove these defects binarization method is used. In this paper a phase based binarization method is studied in which phase of ancient document images is preserved. This method is derived in to three steps: preprocessing, main binarization and post processing. In prep...

متن کامل

Foreground-Background Regions Guided Binarization of Camera-Captured Document Images

Binarization is an important preprocessing step in several document image processing tasks. Nowadays handheld camera devices are in widespread use, that allow fast and flexible document image capturing. But, they may produce degraded grayscale image, especially due to bad shading or non-uniform illumination. State-of-the-art binarization techniques, which are designed for scanned images, do not...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:
  • Pattern Recognition

دوره 40  شماره 

صفحات  -

تاریخ انتشار 2007